# Configuration for the experiments

We have three categories of experiments, namely, SDMP, post SDMP classification, and other baselines. 

## Configs for SDMP
Configurations are diviced into 4 groups, namely, __general and training__, __theta configs__, __h configs__ and __target configs__. When any of the parameters are missing, the default values in sdmp_default.yml will be assigned. 

### General and training
* name: dataset name. This need to be exact since it will influence the data loading apis.
* epoch: number of epochs to train.
* batch_size: training batch size.
* eval_step: every __eval_step__, we perform the evaluation. The evaluated performance might be worse than before since Theta that is not in the currrent mini-batch could get a worse h after this mini-batch.
* eval_batch_size: batch size for evaluation when needed.
* feature_normalize: normalization choices for features for each node. 
    * "row_sum": divice the node feature by the sum of them;
    * "standard": standardization so that the mean is 0 and var is 1 for each node feature;
    * "no": no feature normalization.
* split_seed: random seed for the induct_train and partial_test split (__not for train, val, test split__). With fixed split_seed and the same setting of inductd_train_ratio and partial_test_ratio, the split will be the same. 
* inductive_train: whether train h based on all nodes or on only a partial set of nodes.
* inductive_train_ratio: the proportion of the nodes that are used for training when induct_train is set to true.
* partial_test: whether compute testing results on a proportion of nodes. In large graphs, the testing on complete graph could be slow. If this is on, only h is saved after training. The full ThetaT will be computed automatically in downstream MLP test. 
* partial_test_ratio: the proportion of the nodes that are used for testing. When both inductive_train and partial_test are enabled, the sum of induct_train_ratio and partial_test_ratio should be greater than 1. 

### theta configs
* theta_n_nonzeros: the maximum iteration in LAR algorithm, which is the upper bound of nonzeros entries for theta for each node. Usually the actual number of nonzeros will be much smaller than this upper bound.
* theta_cand_mode: the mode to generate the cnadidate of theta. 
    * "full": use all nodes as candidate;
    * "dense": use the __theta_cand_k1__ th power of the adjacent matrix as the candidate;
    * "sparse": use a __theta_cand_k2__ order random walk with fanout __theta_cand_fanout__ to determine the candidate sets, importance sampling is adopted;
    * "mixed": a mixe of "dense" and "sparse".
* theta_cand_k2: the random walk depth for generating the candidate in "sparse" and "mixed" mode.
* tehta_cand_fanout: a list of the fanout for the random walk for the "sparse" and "mixed" mode. The right-most number indicate the neighbours for the output nodes. 
* theta_cand_k1: the order of adjacent matrix in "dense" and "mixed" mode. Choose 1-3 for small graphs (less than 10k nodes), and choose 1 for large graphs due to efficiency issue. 
* theta_cand_add_self: whether to add self as the candidate. Suggeted to always turn this on. 

### h configs
* h_init_theta_mode: the method to generate the initial solution of theta to train the initial value of h
    * "dense": use __h_init_theta_k__ th power of normalized adjacent matrix;
    * "sparse": use random walk with importance sampling with __h_init_theta_k__ depths and __h_init_theta_k_fanout__ as fanout.
* h_init_theta_k: the order or depths for the adjacent matrix or the random walks.
* h_init_theta_k_fanout: A list of the fanout for the "sparse" mode. The right-most number indicate the neighbours for the output nodes.
* h_init_epoch: number of epoch of training during the initialization of h. For number less than 1, only a randomly sampled h_init_epoch portion of training samples are adopted for the initialzation of h. For number larger than 1, it should be integer. The number of samples per epoch will be affected by __inductive_train__ and __inductive_train_ratio__.
* h_hidden: hidden neurons of h model. With blank list, we only have 1 layer MLP; with more number, we add that number of hidden layers with indicated hidden dimensions. Usually blank list yields a best performance. 
* h_loop_cnt: during the main loop, how many updates of h for one mini-batch.
* h_lr: learning rate of h model.
* h_l2: weight decay for h model.
* h_dropout: dropout ratio for h model.
* h_extra_sample: whether to add extra samples besides the current batch of samples to train h. 
* h_extra_sample_size: the size of the randomly added nodes to train h. 

### target configs
The configuration for the target GNN. 
* target_h_mode: currently two modes are immplemented.
    * "internal": the internal mode, from SAGE models trained with script/baseline_SAGE_test.ipynb. During the SDMP training, targets are computed during the preprocessinng. 
    * "external": the external mode. Targets are precomputed and stored. 
* target_h_model: name of the target h model. This influence the cache path for the preprocessing results. 
* target_h_model_path: the file name for the target h model (internal mode) or the saved target embeddings (external mode).
* target_h_model_conf_path: the training configration of the target model. Only the internal mode uses this paramter. 
* target_h_model_metric_path: the file name to store the string of the testing results for the target model. 

The folder path of the above three paths is specificed by the --gnn option in main_gen_SDMP.py and main_test_SDMP.py
